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Abstract 

Background: The robust identification of isotope patterns originating from peptides being analyzed through mass 
spectrometry (MS) is often significantly hampered by noise artifacts and the interference of overlapping patterns 
arising e.g. from post-translational modifications. As the classification of the recorded data points into either 'noise' or 
'signal' lies at the very root of essentially every proteomic application, the quality of the automated processing of mass 
spectra can significantly influence the way the data might be interpreted within a given biological context. 

Results: We propose non-negative least squares/non-negative least absolute deviation regression to fit a raw 
spectrum by templates imitating isotope patterns. In a carefully designed validation scheme, we show that the 
method exhibits excellent performance in pattern picking. It is demonstrated that the method is able to disentangle 
complicated overlaps of patterns. 

Conclusions: We find that regularization is not necessary to prevent overfitting and that thresholding is an effective 
and user-friendly way to perform feature selection. The proposed method avoids problems inherent in 
regularization-based approaches, comes with a set of well-interpretable parameters whose default configuration is 
shown to generalize well without the need for fine-tuning, and is applicable to spectra of different platforms. The R 
package IPPD implements the method and is available from the Bioconductor platform (http://bioconductor.fhac. 
org/help/bioc-views/devel/bioc/html/IPPD.html). 



Background 

Mass spectrometry (MS), often in conjunction with high 
performance liquid chromatography (HPLC), is the de- 
facto standard analytical tool to derive important bio- 
logical knowledge about the protein content of whole 
cells, organelles, or biomedical samples like tumour or 
blood plasma. Within a typical experimental setup, puri- 
fied proteins of the sample under study are digested by 
an enzyme. Before entering the mass spectrometer, pep- 
tides are separated chromatographically according to their 
physico-chemical properties in order to avoid a massive 
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overlapping of peptide signals within a single scan. Nev- 
ertheless, due to the sheer number of peptides present in 
a sample, interfering patterns still occur frequently, not 
least because of post-translational modifications such as 
the deamidation of asparagines or glutamine residues. In 
order to obtain an unambiguous assignment of the sig- 
nals, and in particular their isotope patterns, which is a 
prerequisite for a proper identification and quantification, 
every data point in m/z dimension is classified either as 
signal' or as 'noise' during the so-called feature detection 
phase. As this processing lies at the very root of every 
proteomic application, the quality of feature detection can 
have dramatic impact on the finally derived results and 
conclusions. In view of the large amount of data even a 
single MS experiment can produce, automated analysis is 
indispensable. However, due to various artifacts arising 
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from electric and chemical noise and baseline trends, the 
identification of isotope patterns is error-prone and time 
consuming. In addition, severe overlaps of peptide sig- 
nals within the same mass spectrometric scan can hamper 
a straightforward analysis furthermore. In recent years, 
numerous procedures have been developed to process 
this data (cf., e.g., [1-8]). Within this paper, we propose a 
novel method that is demonstrated to perform especially 
well in challenging situations, characterized e.g. by strong 
local variations in noise and intensity levels or the pres- 
ence of isotope patterns of different charges exhibiting 
overlap, which in many cases may be difficult to resolve 
even for a human expert by visual inspection. Existing 
software typically depends on a large set of parameters 
requiring careful fine-tuning, often being rather sensitive 
to changes in the measurement process like the change 
of the platform, which makes a proper parameter choice 
a laboursome task. In contrast, the proposed method has 
been designed to depend on a comparatively small set 
of well-interpretable parameters whose default configura- 
tion is shown to be robust, yielding mostly excellent, but at 
least competitive performance on spectra of different plat- 
forms. In a nutshell, our method uses non-negative least 
squares or non-negative least absolute deviation regres- 
sion to fit a spectrum s by a large dictionary of templates 
mimicking isotope patterns; since true positions and 
charges of isotope patterns in the spectrum are unknown 
in advance, regions where the signal exceeds a local mea- 
sure of noise are identified and then a vast set of templates 
is placed in those regions. In the spirit of sparse recov- 
ery, a small subset of the templates, which reasonably 
explains the observed signal, is selected by applying hard 
thresholding with a locally adaptive choice of the thresh- 
old to the regression coefficients obtained previously. Our 
method is related to a formerly proposed template-based 
approach (NITPICK, [3]). As opposed to the present 
work, NITPICK uses i\ -regularized non-negative least 
squares. Without non-negativity constraints, this proce- 
dure is known as the lasso [9]. Reference [10] contains 
the first application of the lasso to the problem studied 
in the present paper. Given a dramatic increase in occur- 
rence of high-dimensional datasets in recent years and the 
resulting need for feature selection, the lasso, due to com- 
putationally and theoretically appealing properties, has 
meanwhile become so popular that it can be regarded as a 
standard tool of modern data analysis [11]. In this respect, 
NITPICK follows the usual paradigm suggesting that l\- 
regularization is the method of choice. In the present 
paper, we argue for a deviation from that paradigm mainly 
in view of the following two aspects. First, a major bene- 
fit of our fitting+thresholding approach is that parameter 
choice is more user-friendly, since the threshold can be 
interpreted in terms of a signal-to-noise ratio. This is 
unlike the regularization parameter of the lasso, which can 



in general not be related directly to the signal. In the pres- 
ence of heterogeneous noise and model misspecifications, 
the right amount' of regularization is notoriously diffi- 
cult to choose. Second, there is a substantial body of work 
showing that non-negativity constraints alone may suffice 
to recover a sparse target. Non-negative least squares + 
thresholding is analyzed in [12], where it is shown that it 
can significantly outperform the usual i\ -approach with 
respect to sparse recovery. See Section "Sparse recov- 
ery with non-negativity constraints: non-negative least 
squares + thresholding vs. the non-negative lasso" for a 
detailed discussion. 

Methods 

A spectrum is understood as a sequence of pairs 
{fejOl/Lp where %{ = mi/Zi is a mass (m/, measured in 
Dalton Da) to charge (zi), and yt is the intensity, i.e. the 
abundance of a particular mass (modulo charge state), 
observed at xu i = l,...,n, which are assumed to be 
ordered increasingly. 

Template model 

The (yi)" = i = y are modeled as a positive combination of 
templates designed on the basis of prior knowledge about 
peak shape and composition of isotope patterns. If our 
model were perfectly correct, we could write 

c 

y=*fi* = J2*cfi* c , *c=[<Pc,i...<Pc, Pc }> c=l,...,C, 

c=l 

(1) 

where O is a non-negative matrix of templates and /?* 
is a non-negative coefficient vector. Both O and /?* can 
be arranged according to charge states c = 1, . . . , C. 
Each sub-matrix 4> c can in turn be divided into columns 
<Pc,i> - - -'Vcpc* wnere the entries of each column vector 
store the evaluations of a template (p Ct j, j = 1, . . . ,p c , at 
the Xi, i = 1, . . . , n. It is assumed that only a small fraction 
of the templates in 4> are needed to represent the signal, 
i.e. /?* is highly sparse. The templates are of the form 

Pc,; = ^2 a cj,k fc,j,k,O c ,j> ( 2 ) 
keZ 

where the ^ c ,j,k are functions representing a single peak 
within an isotope pattern, depending on a location m Ci j 
and a parameter vector 0 Cf j. In general, peaks can be 
modeled by Gaussian, Lorentzian, and sech 2 shapes, cf. 
[13]. Due to their similarity, we restrict ourselves to the 
Gaussian, but provide in addition the exponentially mod- 
ified Gaussian (EMG, cf., e.g., [14]), a model for a possi- 
bly skewed peak as occuring frequently in MALDI-TOF 
recordings, where late ion formation in the gas phase 
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leads to tailed peaks [15]. The EMG is parameterized by 

0 C) j — (a C) j,a c> j, [i C) j) T e IR + x IR + x R (for a c ,j I 0, one 
obtains a Gaussian) 

i (\ 1 / a <>j . M ^ ~ ( * ~ m ^ k) \ (*\ 
fc,j,k( x ) =— ex P ( ^"T" + 7. I ( 3 ) 



2a 



x 1 -F 



cr c ,j Vcj ~(x- m c j )k ) 



J c,J 



fW= /-oo7fe eXP ("T)^ 



In (2), the nonnegative weights a Ct j t k equal the height of 
the isotopic peak k within the pattern / of charge state c. 
These heights are computed according to the averagine 
model [16]. The rn c ,j } k are calculated from m C) j as m C) j } k = 
rn C) j + K k , where k usually ranges between 1.002 and 1.008 
Dalton, see e.g. [17]. Note that in Eq. (2) the location of the 
most intense peak (# c ,/,o = max^ a Cj j^) is taken as charac- 
teristic location of the template instead of using the finally 
reported monoisotopic position: we set m c , ; ;o = Wc,j so 
that the remaining m Cj/ ^, k ^ 0, are computed by shift- 
ing m C) j in both directions along the m/z axis. With the 
normalization max x (p c ,j(x) = 1 for all c,j, the entries of f$* 
can be interpreted as intensities of the most intense peaks 
of the templates. The construction scheme is illustrated in 
Figure 1. 

Parameter estimation 

The parameters 0 Ct j = (a C) j, a C) j, /x c> y) T of the peaks (3) 
are unknown in practice. Following a central paradigm of 
our framework, which is to relieve the user of performing 
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Figure 1 Template model. Illustration of the template construction 
(charge state c = 2) for an EMG peak shape with a moderately strong 
right tailing. 



laboursome fine-tuning of parameters, we have developed 
a systematic procedure automatically providing estimates 
of these parameters, which is considerably more efficient 
and flexible than a grid search. For instance, the param- 
eters may additionally depend on the m/z-position. Our 
framework for parameter estimation extends a conceptu- 
ally similar approach in [18] designed for a Gaussian peak 
shape. 

In a first step, we apply a simple peak detection algo- 
rithm to the spectrum to identify disjoint regions 1Z r C 
{1, . . . , n}, r = 1, ... ,R, of well-resolved peaks. For 
each region, we fit the chosen peak shape to the data 
{(xi>yi)}ien r using nonlinear least squares: 



min^O'r 

ieU r 



(4) 



yielding an estimate 0 r (x r ), where ^c r denotes an estima- 
tion for the mode of the peak in region 1Z r . This concept 
is sketched in Figure 2. The nonlinear least squares prob- 
lem (4) is solved by using a general purpose nonlinear 
least squares routine available in most scientific comput- 
ing environments, e.g. nls in R. Once the sequence of 
estimators {0 r (x r )} has been obtained, they are subject 
to a suitable aggregation procedure. In the simplest case, 
one could simply take averages. For spectra where peak 
shape characteristics, in particular peak width, are known 
to vary systematically with m/z position, we use the pairs 
{(x r ,O r (x r ))} as input into a linear regression procedure 
to infer the parameters of pre-specified trend functions. 
Formally, we model each component 0/ of 0 as a linear 
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Figure 2 Parameter estimation. Illustration of peak parameter 
estimation. The figure displays a well-resolved peak in the region 
H = {i: 941. 3 Th < x-, < 941.9 Th}. In this example, the size of H 
equals 33, i.e. there are 33 pairs {(x/,y,)} that enter a nonlinear least 
squares problem of the form (4). Under the assumption of an EMG 
model, the resulting fit is indicated by a solid line. 
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combination of known functions gi >m of x = m/z and an 
error component £/, i.e. 

Mi 

Oi(x) = ^2 v l,mgl,m(x) + ei(x), (5) 

m=l 

for which a linear trend i.e. 6[(x) = v/j + v^x, is one of the 
most common special cases. In [19], a set of instrument- 
specific models for the peak width is provided, all of which 
can be fitted by our approach. 

We refrain from using least squares regression to deter- 
mine the parameters in (5) due to its sensitivity to possible 
outliers, which arise from poorly resolved, wiggly or over- 
lapping isotope patterns, which may affect the quality 
of the estimates 0 r . Therefore, the linear model is fit- 
ted in a robust way by using least absolute deviation 
regression. Given the resulting estimates of the parame- 
ters m/z- specific estimates for the parameters in (3) 
are obtained by evaluating (5). 

Template fitting 

The computation of the design matrix 4> requires a set 
of m/z positions at which templates are placed. In gen- 
eral, one has to choose positions from the interval 
We instead restrict ourselves to a suitable subset of the 
set {xiYl =l . ^ ne deviations from the positions of the 
true underlying isotope patterns is then at least in the 
order of the sampling rate, but this can be improved 
by means of a postprocessing step described in Section 
"Postprocessing and thresholding". Using the whole set 
{xi}*l=i may be computationally infeasible if n is large 
and is in fact not necessary since isotope patterns occur 
very sparsely in the spectrum. Therefore, we apply a pre- 
selection step on the basis of what we term local noise 
leveF (LNL). The LNL is denned as the median of the 
intensities yi falling into a sliding window of fixed width 
around a specific position. For x e[xi,x n ], we define the 
local noise level based on sliding window width h as 

LNLOO = median({j/; : i e l x })> (6) 
X x = {i: Xi e[x-h,x + h]}. 

Given the LNL, we place templates at position X{ (one 
for each charge state) if the corresponding yi exceeds 
LNL(#;) by a factor factor . place. Section "Finding a 
set of default parameters" describes how we determined 
defaults for the two parameters h and factor. place. 
In fact, the LNL is a central quantity in our framework, 
because it does not only influence the placement, but also 
the selection of templates (see Section "Postprocessing 
and thresholding" below). Choosing h too small typically 
has the effect that the LNL is overestimated such that true 
peaks might be incorrectly classified as noise. Conversely, 
choosing h too large leads to an underestimation, thereby 



increasing the computational burden as well as the num- 
ber of spurious patterns included in the final list. The 
advantages of working with the median are obvious: easy 
computation, robustness and equivariance with respect to 
monotone transformations. Similar notions of local noise 
can be found in the literature, see e.g. [8] where a trun- 
cated mean is used. Given the positions of the templates, 
we generate the matrix 4> according to Eqs. (1) and (2). In 
the fitting step, we compute a non-negative least squares 
(q = 2) or alternatively non-negative least absolute devi- 
ation (q = 1) fit by determining a minimizer /? of the 
criterion 

mm\\y-&p\\l, q = 1 or q = 2, (7) 

The optimization problem (7) is a quadratic (q = 2) or 
linear (q = 1) program and is solved using interior point 
methods (e.g. [20]). The details are relegated to Appendix 
"Fitting with non-negativity constraints" section. As far as 
the choice of q is concerned, we point out that q = 1 
yields a robust fit that can deal better with deviations from 
model assumptions, i.e. deviations from the averagine 
model or from the peak model. However, in general, we 
are unable to provide any recommendation about how to 
choose q. Therefore, in our validation, both are evaluated. 

Comparison with pepex 

In prior work [21], subsequently referred to as pepex', 
non-negative least squares fitting is used as well. An 
important difference to our approach is that the matrix 
4> is not constructed from the convolution of isotope 
distributions and peak shapes as described in Section 
"Template model". Instead, peak detection is applied first 
to reduce the raw intensity data to peak clusters, a step 
that is usually referred to as centroiding. At the second 
stage, called de-isotoping, peak clusters are fitted by a 
design matrix containing isotope distributions them- 
selves, not convolved versions. While the approach is 
computationally more attractive and avoids estimation of 
peak shape parameters (cf. Section "Parameter estima- 
tion"), the division into centroiding and de-isotoping may 
lead to poor performance for low resolution and noisy 
data, or in the presence of overlapping patterns. In these 
cases, peak detection is little reliable. In our template- 
based approach, there is no separation of centroiding and 
de-isotoping. It performs much better in the aforemen- 
tioned cases, since it operates directly on the data and is 
hence less affected if single peaks of a pattern are difficult 
to detect. This reasoning is supported by our evaluation in 
Section "Results and discussion" as well as that in [3]. At 
the same time, our approach can in principle be applied to 
centroided spectra as well. In this case, the columns of the 
matrix 4> directly represent isotope distributions instead 
of isotopic patterns. 
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Postprocessing and thresholding 

While indeed a considerable fraction of the entries of 
P are precisely equal to zero, treating all positions for 
which the corresponding entry differs from zero as loca- 
tions of isotope patterns would yield a huge number of 
false positives, at least because of regions, in which noise 
fitting reduces the objective in (7). Therefore, the fitting 
step of the previous section is accompanied by a thresh- 
olding step, with the aim to separate signal from noise. 
However, fitting followed by thresholding alone does not 
lead to a proper output. The strategy could be successful 
if our template model were free of any kind of misspec- 
ification. Even when neglecting possible misfits of the 
averagine model, we still have to cope with two sources 
of systematic errors — a limited sampling rate and mis- 
matches in the peak model. These are the main reasons 
for what we term peak splitting', referring to the phe- 
nomenon that several templates are used to fit precisely 
one pattern. Figure 3 illustrates the effect of sampling in a 
noiseless setting. In the top panel, the signal is sampled in 
such a way that the top of the peak is lost. When placing 
two templates at the two sampling points xu x u of maxi- 
mum signal, non-negative least squares fitting attributes 
weights f3[, p u of roughly equal size to the templates. The 
postprocessing procedure outlined below yields a suitable 
correction. One might object that peak splitting' is a prob- 
lem inherent in our entirely fitting-oriented approach (7) 
not incorporating any form of regularization. The bottom 
panel of Figure 3 shows the solution path of the non- 
negative lasso [22] given by {/J(A.), k > 0}, 0(A) = 
argmin jg>0 || y — ^jfff^ + A1 T /?. One obtains two nearly 
parallel trajectories, demonstrating that only a heavily 
biased fit, which would undesirably lead to the exclu- 
sion of additional smaller signals, could accomplish the 
selection of only one template. 

To a large extent, peak splitting' can be corrected by 
means of the following merging procedure, which we 
regard as postprocessing of the fitting step (7) and which 
we applyjDrior to thresholding. Given an estimate ft, we 
define M c = {m c ,j ♦ Pc,j > 0} C c = 1, . . . , C, as 

the set of all template locations where the corresponding 
coefficient exceeds 0. 

1. Separately for each c, divide the sets A4 C into groups 
Ge,i> • • • > Qc,G c of 'adjacent' positions. Positions are 
said to be adjacent if their distance on the m/z scale 
is below a certain tolerance ppm specified in 
parts-per-million, cf. Section "Finding a set of default 
parameters". In the context of 'peak splitting', the 
templates at locations sharing the same group are 
assumed to fit precisely one true underlying peak. 

2. With the notation of Eq. (2), for each c = 1, . . . , C, 
and forg= 1, . . . , G c , we solve the following 
optimization problem. 
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Figure 3 Peak splitting. Lower panel: Non-negative least squares fit 
of the sampled signal with and without postprocessing. Upper panel: 
Solution path of the non-negative lasso for the same data. 



(fn c ,g>Pc,g) = argmin / I ^ Pc,jfm c ,j(x) - P c ,gfm Cjg (x) J dx, 

Z C ' 8 -oo W&* ^ / 

Peg 

(8) 

with the aim to find a location m C)g and a weight /3 C)g 
of the most intense peak fe C)<? within an isotope 
pattern cp C)g approximating the fit of the most intense 
peaks {^m C)7 • wic,] £ Qc,g\ within the isotope patterns 
{cp Cf j : m Cf j e Q C)g } best in a least squares sense. 
3. One ends up with sets M c — {™-c,g\ G g L\ an< ^ 
coefficients {ficg^Lv c ~ h • • - >C, 

The additional benefit of solving (8) in step two as com- 
pared to the selection of the template with the largest 
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coefficient within each group as proposed in [3] is that 
we are able to determine the location of the pattern even 
more accurately as predetermined by a limited sampling 
rate, since in (8) we optimize the location over a con- 
tinuum. The optimization problem (8) can be solved fast 
and accurately by sampling the integrand on a fine grid of 
points and then solving a nonlinear least squares problem 
with optimization variables m C)g and /3 C}g . 

All candidate positions (fn c ,g> Peg) are assigned a signal- 
to-noise ratio 



ration = GOF+(rh c>g ) 



LNL + (m C)g ) 



(9) 



where LNL + (m C) ^) = max {LNL(m C) ^), ^median 
({LNL( > %/)}^ 1 )} is a truncated version of the local noise 
level, with a lower bound included to avoid that the deno- 
minator in (9) takes on tiny values in low-intensity regions. 
The factor GOF + (m c ^) represents a goodness-of-fit 
adjustment, a correction which aims at downweighting 
spurious peaks in low-intensity noise regions. These are 
not hard to distinguish from signal regions, which, in view 
of the presence of peak patterns, tend to be considerably 
regular. In order to spot noise regions, we fit the spectrum 
by single peaks (3) placed at each datum xu i = l,...,n, 
where the peak shape model, the associated peak 
shape parameters and the parameter q are chosen 
according to the choice made for template generation 
(Sections "Template model") and template fitting (Section 
"Template fitting"), respectively. Denote the residuals 
of the resulting fit by {rj}" =1 . A local measure of 
goodness-of-fit is defined by 



GOF + (#) = min 



ieX x 



,0.5 



The idea underlying this procedure is that in noise 
regions, the fit to the data will be poor, and consequently, 
the size of the residuals is expected to be large relative to 
the signal, hence leading to a low goodness-of-fit statistic. 
The truncation at 0.5 limits the influence of this correc- 
tion. A final list is generated by checking whether the 
signal-to-noise ratios (9) exceed a significance threshold' 
t specified by the user. We do not give a general guideline 
for choosing t, because a reasonable choice is very specific 
to experimental conditions, e.g. the platform used and the 
composition of the spectrum. It is important to note that 
while t itself is constant, we take into account that the 
noise level is heterogeneous, since thresholding is based 
on the ratios (9), where the local noise level enters. 

Finding a set of default parameters 

Apart from the signal-to-noise threshold t, we have 
introduced the parameters window, i.e. the width h of 
the sliding window required for the computation of the 



local noise level (6), the template placement parame- 
ter factor. place and the parts-per-million tolerance 
ppm within which peaks are considered to be merged 
by the postprocessing procedure. With the exception of 
the threshold t, we have fixed all parameters to a default 
setting which we expect to give reasonable (albeit poten- 
tially suboptimal) results on spectra different from the 
ones analyzed here, without the need of manual tuning. 
In order to find such a default setting, we performed a 
grid search using only one selected spectrum of those 
described in Section "Datasets" below. While our default 
setting, which can be found in the HTML manual of the R 
package IPPD, already performs well, we recommend to 
do such a calibration to optimize the performance of our 
method. 

Sparse recovery with non-negativity constraints: 
non-negative least squares + thresholding vs. the 
non-negative lasso 

We believe that our preference for the first alternative is 
a major methodological contribution that has potential to 
impact related problems where non-negativity problems 
come into play. In the present section, we provide, at a 
high level, a series of arguments rooting in the statistics 
and signal processing literature that clarify our contribu- 
tion and support our preference. 

Linear models and usual paradigms in statistics 

The fact that we favour non-negative least squares + 
thresholding may seem implausible since it questions 
or partially even contradicts paradigms about high- 
dimensional statistical inference. Consider the linear 
model 



or, 



ye. 



(10) 



which corresponds to model (1), where is used instead 
of to account for stochastic noise or model misspec- 
ifications. Linear models of the form (10) have been and 
continue to be objects of central interest in statistical 
modelling. 

• Classical work in statistics shows that under mild 
conditions if the number of sample n grows at a 
faster rate than the number of features p, the 
ordinary least squares estimator —> /?* (in 
probability) as n — »> oo. 

• Since many contemporary datasets, like the MS 
datasets of the present paper, are characterized by a 
large p, which is of the same order as n or even larger, 
the first bullet has considerably lost relevance. 
Translated to MS datasets, it provides a statement 
about the case where the resolution tends to infinity. 
Therefore, modern statistical theory studies regimes 
in which p is allowed to grow at a faster rate than n, 
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with a focus on results that hold for finite sample 
sizes. These results hinge on some sort of sparsity 
assumption on /?*, the simplest being that /?* is zero 
except for some index set (support) of small 
cardinality. In this context, a multitude of results has 
been proved (see e.g. [23] for an overview) indicating 

that the lasso estimate P is a statistically optimal 
procedure in the sense that if the regularization 
parameter is chosen in the right way, the squared 

Euclidean distance \\fi — /?* \\\ is nearly of the 
same order as that of an estimator one could 
construct if the non-zeroes of /?* were known. 

The second bullet provides quite some justification 
for NITPICK, which is based on the lasso. However, as 
detailed below, the italicized part can be critical On the 
other hand, there are several results that support our 
approach. 

The power of non-negativity constraints 

• It turns out that the non-negativity constraint > 0 
imposed in non-negative least squares (NNLS) may 
lead to a drastically better performance than that of 
the ordinary least squares estimator in large p 
situations provided 4> satisfies additional conditions. 
Roughly speaking, it is shown in [12] that if O has 
non-negative entries, which is fulfilled for the 
template matching problem of Section "Template 
Model", the NNLS estimator /$ does not overflt and is 
unique even in the singular case (p > n). These 
results indicate that NNLS may behave surprisingly 
well in a high-dimensional setup, without using 

i\ -regularization, which is often propagated in the 
literature as basically the only option ([24], Section 
16.2.2). 

• There are several recent papers [25-27] in the sparse 
recovery literature in which it is shown that a sparse, 
non-negative vector can be recovered from few linear 
measurements n p. In [12], these results are 
extended to a noisy setup. More specifically, it is 
shown that NNLS + thresholding can consistently 
recover /?* and its support. Very recently, using 
similar conditions as in [12], Meinshausen [28] has 
established several guarantees of NNLS in a 
high-dimensional setup. 

One should bear in mind that the non-negativity con- 
straints are essential for our approach. Thresholding the 

unconstrained ordinary least squares estimator /$ in 
general leads to poor results in the large p situation. 

Shortcomings ofl-i -regularization in theory 

In [12], it is not only shown that NNLS + thresh- 
olding is a sound strategy to perform sparse recovery 



of a non-negative target, but also examples are given 
where the non-negative lasso is outperformed even if 
its regularization parameter is set to match theoretical 
results and regardless of whether subsequent threshold- 
ing as advocated in [29,30] is used or not. In partic- 
ular, inferiority of the lasso arises in the presence of 
small, yet significantly non-zero entries in /?*. These 
are specifically affected by the non-negligible bias of 
i\ -regularization [31]. It is important to note that the 
comparison in [12] does not contradict prior compar- 
isons of the lasso (aka soft thresholding) and (hard) 
thresholding for orthonormal designs (4> T 4> = /) in 
[32,33], where both approaches perform similarly well and 
non-negativity constraints are not particularly important. 
Orthonormal designs, which lead to greatly simplified 
estimation problem are not of interest in the context of 
the paper, since the template matrix O is far from being 
orthonormal. 



Shortcomings ofl<[ -regularization in practice 

The study in [12] is of more theoretical nature, since 
all constants of the problem, in particular the noise 
level, are known, so that the regularization parame- 
ter can be set in an optimal fashion. This can realis- 
tically not be accomplished in practice. Likewise, the 
information-theoretic criterion employed in [3] as well 
as the data-splitting approach of [34] rely on knowl- 
edge of the noise level, or a consistent estimate thereof, 
which is hard to obtain in the large p situation [35]. In 
any case, the regularization parameter remains a quan- 
tity that is hard to grasp and hence hard to set for a 
practitioner, since it cannot be related directly to the sig- 
nal. In contrast, the threshold t admits a straightforward 
interpretation. 

Moreover, when using i\ -regularization, data fitting and 
model selection are coupled. While this is often regarded 
as advantage, since model selection is performed automat- 
ically, we think that it is preferable to have a clear sepa- 
ration between data fitting and model selection, which is 
a feature of our approach. Prior to thresholding, the out- 
put of our fitting approach gives rise to a ranking which 
we obtain without the necessity to specify any parameter. 
Selection is completely based on a single fit simply by let- 
ting the the threshold vary. On the contrary, if one wants 
to reduce the number of features selected by the lasso, 
one resets the regularization parameter and solves a new 
optimization problem. Note that it is in general not pos- 
sible to compute the entire solution path of the lasso [22] 
for the MS datasets used for the present paper, where the 
dimension of O is in the ten thousands so that the active 
set algorithm of [22] is prohibitively slow. In this regard, 
model selection by thresholding is computationally 
more attractive. 
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Results and discussion 

For the assessment of the pattern picking performance, 
in total eight spectra generated by two different ioniza- 
tion methods, matrix assisted laser desorption/ionization 
(MALDI) and electrospray ionization (ESI), respectively, 
form the basis of the evaluation. While MALDI has been 
coupled to a time-of-flight (TOF) mass analyzer, ESI MS 
spectra have been recorded on both a linear ion trap 
(LTQ) and an Orbitrap mass analyzer. In addition, a series 
of spectra were prepared with the aim of investigating 
in detail the methods performance in the presence of 
overlapping peptides. 

Datasets 

For MALDI mass spectra (Additional file 1), time of flight 
mass analysis was performed; spectra were recorded on 
an ABI MALDI-TOF/TOF 4800 instrument in positive ion 
mode using a-cyano-4-hydroxy-cinnamic acid (CHCA) 
as matrix. Nanospray ESI spectra (Additional file 2) were 
measured in positive ion mode on a Thermo LTQ Orbi- 
trap Velos MS; both high resolution measurements using 
the Orbitrap mass analyzer (referred to as 'Orbitrap') and, 
alternatively, low resolution linear ion trap (IT) measure- 
ments were performed with this setup. This experiment 
has been chosen in order to demonstrate the utility of 
our method at different concentration levels, that it is 
robust with respect to changes in the data-generating pro- 
cess and that the method is capable of handling singly 
charged ions, the main form generated by MALDI MS, 
as well as higher charged ions formed in ESI MS. Tryp- 
tic digests (performed in 40 mM ammonium bicarbonate) 
of model proteins were used as analytes: bovine myo- 
globin and chicken egg lysozyme (10 and 500 fmol each) 
for MALDI-TOF experiments, and lysozyme (250 and 
1000 fmol) for ESI experiments. Disulfide bonds were 
reduced with dithiothreitol (DTT) prior to alkylation, free 
cysteine residues were alkylated by iodacetamide. No fur- 
ther sample pretreatment was performed prior to MS 
analysis. When referring to these spectra, we omit that 
tryptic digests are given: e.g., the term 'MALDI-TOF myo- 
globin spectrum (500 fmol)' means the respective tryptic 
digest. 

To demonstrate explicitly the methods ability to sepa- 
rate strongly overlapping patterns even in the case of badly 
resolved signals, 22 additional spectra have been gener- 
ated in positive ion mode on a Bruker Daltonics HCT 
Ultra Ion Trap MS with an electrospray ion source. Three 
synthetic peptides (cf. Section "Unmixing of overlaps" 
for details) with sequences corresponding to tryptic pep- 
tides from bovine serum albumin (BSA) were used as 
analytes. In each measurement two out of three peptides 
were mixed in different ratios to get overlapping pep- 
tide signals, also with different charge states. Two differ- 
ent concentrations (500 fmol//xl and 1000 fmol//xl) were 



injected into the mass spectrometer via a Cole-Parmer 
syringe pump. 

Validation strategy 

Validation of pattern picking is notoriously difficult, 
because a gold standard which is satisfactory from both 
statistical and biological points of view is missing. In this 
context, a major problem one has to account for is that 
spectra frequently contain patterns whose shape is not 
distinguishable from those of peptides, but which are in 
fact various artifacts resulting e.g. from impurities during 
sample preparation and measurement. These artifacts do 
not constitute biologically relevant information and are, 
in this sense, 'false positives! An important instance are 
signals derived from the matrix (or from matrix-clusters) 
frequently observed in MALDI MS. The pattern of these 
signals is similar to that of peptides; nevertheless, due to 
their molecular composition, which differs significantly 
from that of an average peptide, the exact masses can be 
used to exclude these signals from the data analysis. On 
the other hand, from a statistical perspective which judges 
a method according to how well it is able to detect specific 
patterns in a given dataset, a qualification as 'true positive' 
is justified. With the aim to unify these aspects, we have 
worked out a dual validation scheme. In order to reduce 
the number of artifacts, all automatically generated lists 
of candidates for peptide masses as well as the lists of a 
human expert (see below) are postprocessed by a peptide 
mass filter [36]: only peptides whose monoisotopic mass 
deviated less than 200 ppm from the closest peptide mass 
center a are used for subsequent evaluation. 

Comparison with manual annotation 

The first part investigates how well a method is able to 
support a human expert who annotates the spectra manu- 
ally. More specifically, the automatically generated lists are 
matched to the manual annotation such that an entry of 
the list (potential peptide mass) is declared 'true positive' 
whenever there is a corresponding mass in the manual 
annotation deviating by no more than A ppm. Otherwise, 
it is declared 'false positive! In order to adapt A ppm to the 
resolution of the different mass lines, we used the follow- 
ing strategy: assuming that most of the peptides will have 
a mass larger than 700 Da, we determined the spacing 
A m / Z between neighboring data points in m/z direction 
for each mass spectrum in the lower mass range. If we 
further assume that a simple manual annotation by visual 
inspection can result in a mass deviation from the 'correct' 
mass position of at most A m / Z , we can derive the following 
tolerance values: A = 100 ppm for ion trap recordings, 
A = 50 ppm in the case of MALDI-TOF recordings 13 and 
A = 20 ppm for Orbitrap data. 

As the performance of our as well as those of all com- 
peting methods depends on a threshold-like parameter 
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governing, crudely speaking, the trade-off between preci- 
sion and recall, we explore the performance for a range 
of reasonable parameter values, instead of fixing an (arbi- 
trary) value, which we believe to be little meaningful The 
results are then visualized as ROC curve, in which each 
point in the (Recall, Precision) -plane corresponds to a 
specific choice of the parameter. Formally, we introduce 
binary variables [Bi(f)} for each mass i contained in the 
list of cardinality Lit) when setting the threshold equal 
to t, where Bi(t) equals 1 if the mass is matched and 0 
otherwise, and denote by L the number of masses of the 
manual annotation. The true positive rate (recall, R), and 
the positive predictive value (precision, P) associated with 
threshold t are then denned by R(t) = p( f ) = 

2(t) ' ^ n ROC curve results from a sequence of pairs 
[R(t),P(t)} for varying t. 

Database query 

The second part evaluates the lists in terms of a query to 
the Mascot search engine [37], version 2.2.04. In particu- 
lar, we account for a major problem of a manual annota- 
tion, namely that peptides yielding weak MS signals might 
easily be overlooked, but might be detected by methods 
designed to extract those weak signals. Since we are espe- 
cially interested in demonstrating the methods ability to 
separate overlapping patterns, we adapted the standard 
search parameters of Mascots peptide mass fingerprint 
routine to allow two missed cleavage sites and to incorpo- 
rate the following (variable) post-translational modifica- 
tions: 'Oxidation (M)', 'Carbamidomethyl (C)', Amidated 
(Protein C-term)', 'Deamidated (NQ)! In particular, the lat- 
ter two modifications will frequently trigger MS signals 
interleaving with the pattern of their unmodified coun- 
terpart: in the case of a deamidation the modified ion 
shows a mass of approx. 0.98 Da more compared to the 
amidated peptide. The same mass tolerances as for the 
manual annotation are used. As for the comparison with 
the manual annotation, we evaluate several lists corre- 
sponding to different choices of the threshold. Instead of 
an ROC curve, which turned out to be visually unpleas- 
ant, we display the statistics (score, coverage and fraction 
of hits) of two lists per method, namely of those achiev- 
ing the best score and the best coverage, respectively. The 
complete set of results as well as further details of our 
evaluation like the manual annotation are contained in 
Additional file 3. 

Competing methods 

We compare our method in its two variants depending on 
the choice of the fitting criterion (cf. Eq. (7)), labelled l\ 
(q = 1) and I2 (q = 2), respectively, with the following 
competing methods. 



Lasso 

The lasso' method in this paper serves as surrogate for 
NITPICK. Since the lasso' is embedded into our frame- 
work while implementing a methodology that closely 
resembles NITPICK, we use the lasso' for the sake 
of convenience, to avoid an involved parameter opti- 
mization for NITPICK. Our lasso implementation ben- 
efits from the improved merging procedure of Section 
"Postprocessing and thresholding". To accomodate a het- 
erogeneous noise level, NITPICK divides spectra into 
bins. This can be avoided by determining a minimizer 
W) of the weighted non-negative lasso problem 

min||v- + A.1 T W7?, A>0, (11) 

where W is a diagonal matrix with entries w C) j = 
LNL + (m C) y), j = 1, . . . ,p c , c = 1, . . . , C, whose purpose 
is to re-scale the amount of €i-regularization according to 
the local noise level. The columns of the template matrix 
$ in (11) are normalized to unit Euclidean norm as it is 
standard in the literature on the lasso. A grid search over 
50 values for X is performed, where the construction of the 
grid follows [38]. Different lasso lists are obtained for each 
active set A(X k ) = {c,j : W) > 0}, k = 1, . . . , 50, 

which are subsequently merged (see Eq. (8)). The param- 
eter X here plays the role of the threshold £, cf. Section 
"Validation strategy". 

Pepex 

As discussed in Section "Template fitting" pepex performs 
centroiding and de-isotoping separately. De-isotoping is 
based on non-negative least squares. Since pepex is lim- 
ited to detect patterns of charge state one, its performance 
is only assessed for MALDI-TOF spectra. Accordingly, 
when comparing the ouptput of pepex with the man- 
ual annotation, the few patterns of charge state two are 
excluded. The parameters nm, pf t, mined, maxed and 
nsam were set to optimize performance with respect to 
manual annotation. The ROC curves are based on peak- 
lists resulting from ten different choices of the signal-to- 
noise parameter snr. 

Isotope wavelet 

As opposed to our method, this approach is not able to 
handle overlaps. On the other hand, it typically shows 
strong performance in noisy and low intensity regions or 
on datasets with extremely low concentrations [39,40]. 
While the isotope wavelet is not limited to charge one, it is 
run in charge one only mode for the MALDI-TOF spectra, 
to achieve more competitive performance. For the sake of 
fair of comparison, the result of the isotope wavelet on 
the MALDI-TOF spectra are evaluated in the same way as 
those of pepex. 
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Vendor 

The parameter setting for the ABI MALDI-TOF/TOF MS 
software was as follows: Local Noise Width (m/z) 250, 
Min Peak Width at FWHM 2.9. The Cluster Area Opti- 
mization S/N threshold has been dynamically adapted to 
about three times the S/N threshold as suggested by the 
ABI documentation. Since the vendor software is limited 
to charge one, its outputs are evaluated in the same way 
as those of pepex. Given the disproportionally high effort 
needed to find an optimal parameter setting of the vendor 
software for ESI spectra, its performance is not assessed. 

Results 

Manual annotation vs. database query 

When inspecting Figures 4 and 5 on the one hand and 
Table 1 on the other hand, one notices that results of the 
evaluation based on the manual annotation are not in full 
accordance with the results of the database query. The dif- 
ference is most striking for the MALDI-TOF spectra at 
500 fmol, where our methods {l\ and I2) yield a significant 



improvement, which does not become apparent from the 
database query. This is because only a fraction of the 
manual annotation is actually confirmed by the database 
query. The part which is not matched likely consists of 
artifacts due to contamination or chemical noise as well 
as of specific modifications not captured by the database 
query. In light of this, our dual validation scheme indeed 
makes sense. 

Comparison 

Figure 4 and Table 1 reveal an excellent performance of 
our methods {l\ and I2) throughout all MALDI-TOF spec- 
tra under consideration. For the myoglobin spectra high 
sequence coverages are attained that clearly stand above 
those of competing methods. For the spectra at 10 fmol, 
only the performance of lasso is competetive with that 
of our methods in terms of the Mascot score; all other 
competitors, including the vendor software which has 
been tailored to process these spectra, are significantly 
weaker. In particular, the strikingly high proportion of 



Myoglobin 500 fmol 



Lysozyme 500 fmol 



o 



o 



CO w 

13 

C 

CD _ 

£ S 

g 

'(/> <N 

o ° 

CD 




l 2 

lasso 
vendor 

isotope wavelet 
pepex 



0.0 0.2 0.4 0.6 0.8 

Recall (manual annotation) 



1.0 



c 
o 

3 
o 
c 
c 

CD 

03 
=3 



g 

'c/> <n 

o ° 

CD 




11 
l 2 

lasso 
vendor 
isotope wavelet 
pepex 



0.0 0.2 0.4 0.6 0.8 

Recall (manual annotation) 



1.0 



Myoglobin 10fmol 



Lysozyme 10 fmol 



o 

3 
o 
c 

CD 
"CD 

CD 

g 
'o 




11 
l 2 

lasso 
vendor 
isotope wavelet 
pepex 



0.0 0.2 0.4 0.6 0.8 

Recall (manual annotation) 



1.0 



c 
o 



03 
O 



CD 

c 

CD 

c 
g 

'o 

CD 




11 
l 2 

lasso 
vendor 

isotope wavelet 
pepex 



0.2 0.4 0.6 0.8 

Recall (manual annotation) 



Figure 4 Results for pattern picking, MALDI-TOF. Pattern picking performance for the MALDI-TOF spectra as described in Section "Datasets". The 
points in the (Recall, Precision)-plane correspond to different choices of a method-specific threshold(-like) parameter. 
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Figure 5 Results for pattern picking, ESI. Pattern picking performance for the ESI spectra as described in Section "Datasets". The points in the 
(Recall, Precision)-plane correspond to different choices of a method-specific threshold(-like) parameter. 



'hits' (> 94%) indicates that even at moderate concen- 
tration levels, our methods still distinguish well between 
signal and noise. This observation is strongly supported 
by the ROC curves in Figure 4, where the precision drops 
comparatively slowly with increasing recall. In this regard, 
our methodology clearly contrasts with approaches like 
the isotope wavelet that aim at achieving high protein 
sequence coverage. The latter often requires the selec- 
tion of extremely lowly abundant peptide signals hidden 
in noise at the expense of reduced specificity. 

For MALDI-TOF spectra at high concentration lev- 
els, pepex achieves the best scores and is competitive 
with respect to sequence coverage. However, the perfor- 
mance of pepex degrades dramatically at lower concen- 
tration levels, as it is unambiguously shown by both parts 
of the evaluation. In particular, the database scores are 
the worst among all methods compared. This provides 
some support for our reasoning at the end of Section 
"Template fitting". 



For the ESI spectra, our methods in total fall a bit 
short of the lasso (particularly for the ion trap spec- 
tra), but perform convincingly as well, thereby demon- 
strating that they can deal well with multiple charge 
states. This is an important finding, since the presence 
of multiple charges makes the sparse recovery prob- 
lem as formulated in model (1) much more challenging, 
because the number of parameters to be estimated as 
well as the correlations across templates are increased. 
In spite of these difficulties, Figure 5 and Table 1 sug- 
gest that the performance of our pure fitting approach 
(7) does not appear to be affected. Using a more dif- 
ficult set of spectra, the capability to process ESI data 
with impressive success is additionally shown in the next 
section. 

Additional remarks 

• In Figure 4, the area under the curve (AUC) of our 
methods attained for myoglobin is higher for lower 
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Table 1 Mascot results 
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Table 1 Mascot results (Continued) 
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Corresponding Mascot results for the data shown in Figures 4 and 5. The left halves of the tables report the statistics when choosing the threshold(-like) parameter to 
optimize the score, the right halves when optimizing the coverage (cvrg, given as fraction). The column 'hits' contains the fraction of masses that could be matched to 
peptide masses in the database. 



concentration. At first glance, this may seem 
contradictory since an increase in concentration 
should lead to a simplified problem. However, a 
direct comparison of the AUCs is problematic, since 
the number of true positives (17 at 10 fmol, 106 at 
500 fmol) is rather different. For instance, there are 
choices of the threshold that yield 18 true positives 
and not a single false positive for both of our methods 
at 500 fmol, yet the AUC is lower. 
• The fact that some of the ROCs start in the lower left 
corner results from outputs containing only false 
positives. 

Unmixing of overlaps 
Motivation 

One of the main advantages of our method over more 
simplistic pattern picking methods is the ability to dis- 
entangle isotope patterns of overlapping peptide signals, 
whose presence may lead to a significantly more challen- 
ing pattern picking problem as e.g. discussed in [41] in 
the slightly different context of intact protein mass spec- 
tra. Therefore, a potential application for our approach 
will be the analysis of a certain class of posttranslational 
modifications, the deamidation of amino acid residues 
containing a carboxamide side chain functionality. The 
deamidation of asparagine (Asn) or glutamine (Gin) 
residues, yielding aspartic acid (Asp) or glutamic acid 
(Glu) residues, respectively, is an important posttransla- 
tional modification, which can have immense effects on 
the structure of peptides [42] and is of great relevance 
in a number of pathophysiological events [43]. During 
the deamidation, the side chain carboxamide is hydrol- 
ysed, which is accompanied by a mass increase of 0.98 
Da. Thus, in a spectrum of a mixture of the amidated and 
deamidated form, a direct overlap of both signals can be 
observed. It has to be noted that additionally to the ami- 
dated/deamidated forms, in case of Asn deamidation, a 
second product containing an iso-peptide bond is formed, 
too, which has the same molecular behaviour; these 
two forms can be identified solely by their differential 
MS/MS behavior. 



Results 

The peptides analyzed here in order to assess the per- 
formance of our approach were synthesized by means 
of Fmoc-solid phase peptide synthesis; sequences corre- 
sponding to tryptic peptides from bovine serum albumin 
(BSA) with the sequences listed in Table 2 were selected. 

In each measurement two out of the three listed pep- 
tides were mixed together in different ratios (Additional 
file 4). Given such a spectrum, we study the question 
whether our method returns the true underlying compo- 
sition. We classify the output of our method as correct 
interpretation of the spectrum if the templates corre- 
sponding to the true underlying peptides achieve signal- 
to-noise-ratios of at least one and these ratios are the 
two largest among all templates used for fitting. This pro- 
cedure corresponds to a selection-optimal choice of the 
threshold based on the knowledge of the true composition 
of the spectrum. This simplification may be justified in 
view of the extreme difficulty of the problem as illustrated 
in Figure 6, in particular in view of lowly resolved spectra 
with an average m/z-spacing of 0.06 Da. For the remaining 
parameters, we compare a grid search (performed sepa- 
rately for each spectrum) and the default parameter set 
(Section "Finding a set of default parameters")- Table 3 
and Figure 6 indicate that already the default parameter 
setting is able to solve successfully a wide range of prob- 
lem instances. As one would expect, Table 3 and Figure 6 
suggest that the higher the concentration and the more 
balanced the amplitudes of the overlapping peptides, the 
more likely it is that the overlap can be resolved. On the 
other hand, the higher the degree of overlap of the pep- 
tides, which depends on both their charges and the dis- 
tance of their positions, the more difficult the problem is. 



Table 2 Peptides mixed together 



Sequence 


Sequence residue 


Monoisotopic mass 




no. in BSA 


(protonated) / charge 


GACLLPK 


1 98-204 


35 1.20437/ +2 


CCTKPESER 


460-468 


35 1.4881 6/ +3 


VLASSAR 


212-218 


352.20850/ +2 
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351 .2[z = 2]/352.2[z = 2] - ratio: 1:1 - conc:1000 



351 .2[z = 2]/352.2[z = 2] - ratio: 1:5 - conc:1000 
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Figure 6 Unmixing of overlap. Graphical representation of selected overlap problems as tabulated in Table 3. The experimental setups are given 
in the title of the plots. The dots represent the signal, while solid and dashed lines represent the templates used by our method to match the signal, 
using the default parameter setting and the choice for the threshold as explained in the text. The grey area represents the overall fit when using the 
selected templates. 
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Table 3 Unmixing of overlaps 



Peptides 




351.2(2)/352.2(2) 




351.4(3)/352.2(2) 






351.2(2)/351.4(3) 




Proportion 


1:1 


1:5 5:1 


1:1 


1:5 5:1 


10:1 


1:1 


1:5 5:1 


1:10 


fmol 


500 


X 


X 




X 


X 






X 


all 1000 


X 


X X 


X 


X X 


X 


X 


X 




500 




x — 




x — 










default 1000 


X 


X 


X 


- X 


X 




X 





Results of the analyses of the series of spectra containing two overlapping target peptides. The first column contains the m/z positions and the charges (in brackets) of 
the two peptides. The two middle columns indicate whether the corresponding problem is successfully solved ('x') or not ('— ') when optimizing over a grid of 
parameter combinations (column 'all') and when using the default parameter set (column 'default'). 



This becomes obvious when considering the overlap of the 
two peptides located at 351.2 and 351.4 Da, respectively. 

Conclusion 

We have proposed a template matching approach for 
feature extraction in proteomic mass spectra. The main 
methodological innovation is a framework for sparse 
recovery in which sparsity is not promoted explicitly by 
a regularization term, as it is usually done and was done 
in previous work. We fully exploit the strength of non- 
negativity constraints, which permits us to circumvent 
the delicate choice of a proper' amount of regulariza- 
tion, an ever-lasting problem in statistics, and to work 
with thresholding instead. The latter is not only com- 
putationally attractive, because one does not have to 
repeatedly solve the same optimization problem for dif- 
ferent choices of the regularization parameter, but also 
increases user-friendliness, since the threshold is directly 
related to the signal-to-noise ratio, the quantity domain 
experts are interested in. The replacement of a regu- 
larization parameter by a threshold is a cornerstone in 
our conceptual design guided by the principle to relieve 
the user from laboursome fine tuning of parameters. We 
believe that a small set of well-interpretable parameters 
with suitable defaults additionally improves robustness 
and reproducibility of results. In this context, we would 
like to emphasize again that apart from the threshold, 
the user does not have to specify any parameters before 
running our software. 

In a comprehensive experimental study involving ins- 
truments of varying resolution and spectra of varying 
concentration levels, where we comparatively assess the 
performance of our approach on the basis of an elaborate 
dual validation scheme, it is demonstrated that the per- 
formance for pattern picking is excellent for MALDI-TOF 
spectra and outstands due to its specificity in selecting sig- 
nal and only little noise. A major strength of the method is 
its ability to unmix overlapping peptide signals as shown 
for a series of ESI spectra. In total, we demonstrate that 
our approach is broadly applicable to a variety of spectra. 
While our approach is guided by a concrete application 



in proteomics, the framework is general enough to be of 
much of use for related deconvolution problems emerging 
in other fields — only the templates have to be adjusted 
according to the specific application. 

While in this paper, we have focused on single spec- 
tra, the approach can be extended to process whole LC- 
MS runs, as it has already been implemented in our R 
package IPPD. More precisely, the sweep line scheme 
of [44] is used to agglomerate the results from sin- 
gle spectra. To apply our methods on a routine basis, 
an improved implementation, notably parallelization, is 
required, since e.g. processing a single spectrum of the 
Maxquant datasets [2] takes 10s on average on a Unix 
system equipped with an Intel(R) Core(TM)2 Duo CPU 
T9400 (2.53GHz) and 4 GB main memory. There is much 
room for an improvement, since our implementation is 
based on interpreted R code. 

Concerning future directions of research, a question we 
have not yet answered in a satisfactory way is the choice of 
the fitting criterion. While both criteria (least squares and 
least absolute deviation) employed in this paper perform 
well, their implicit assumption of additive noise might 
be questionable [45]. It is worth investigating whether a 
multiplicative noise model could even yield better results. 
Second, one might ask whether the performance could 
be further improved when it is used jointly with the iso- 
tope wavelet, which is affected by overlaps, but has the 
potential to achieve higher protein sequence coverage. 

Endnotes 

a Monoisotopic peptide mass centers are modelled by: 
1.000485 -m n +0.029, where m n denotes the nominal mass. 
b For the MALDI-TOF lysozyme datasets an extended 
search tolerance of lOOppm was applied due to experi- 
mental miscalibration of the MS. 

Appendix 

Fitting with non-negativity constraints 

In the following, we provide the details concerning opti- 
mization problem (7). In view of the special structure of O, 
(7) is computationally tractable even if n and the number 
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of templates are in the ten thousands. We exploit the 
sparsity of the problem arising from templates which are 
highly localized, i.e. the domain on which they are numer- 
ically different from zero covers only a small part of the 
whole m/z range of the spectrum. As a consequence both 
4> and the Gram matrix O t 4>, which is crucial in the com- 
putation, can conveniently be handled by using software 
for sparse matrices. For R, such software is available in the 
Matrix package [46]. 

Non-negative least squares 

Consider the quadratic program 



mm-\\y-*fi\\i 
subject to P > 0. 



(12) 
(13) 



In order to solve (12), we use the so-called log-barrier 
method which amounts to solving a sequence of an 
unconstrained nonlinear convex problems in which the 
constraints > 0), j = 1, are taken into 
account by incorporating log-barrier terms — log(j6y)/y 
in the objective. As y —> oo, the log-barrier acts like a 
function which equals +oo if < 0 and zero otherwise. 
Beginning with a moderately sized starting value for y, we 
solve the convex problem 



1 

min - \\y - 



1 p 



(14) 



using Newton's method. The gradient and Hessian with 
respect to ft, respectively, are given by 

Vp = -<l> T (y-<t>p)--[l/p 1 ... l/p p ] T . 
y 

V} = 4> T 4> + -diag(l/^ 2 , . . . , l/p 2 p ). 



Complexity analysis of non-negative least squares 

We here provide the order of magnitude of floating points 
operations (flops) required per update (i.e. per Newton 
step) for the specific non-negative least squares prob- 
lems considered for this paper. In our implementation, we 
exploit that the templates contained in the matrix 4> are 
highly localized. As a result, after a suitable column per- 
mutation, the matrix 4> t O is roughly a band matrix with 
bandwidth k no larger than only few hundreds. The dom- 
inant operation is solving the linear system = — 
with the help of the Cholesky factorization, which can be 
done in 0(pk 2 ) flops (e.g. [20], p.670). Our algorithm ter- 
minates after usually no more than one hundred Newton 
steps. 

Non-negative least absolute deviation 

Consider the optimization problem 



min \\y — <I>/?| 



(15) 



subject to P > 0. (16) 
Problem (15) can be recast as the following linear pro- 



gram. 



min r T 1 



subject to 
O/? -y + r > 0, 
y- O/? +r > 0, 
r > 0, 
P > 0. 



(17) 



For its solution, we use the log-barrier method sketched 
in the previous paragraph. After incorporating log-barrier 
terms for all constraints, the objectives of the uncon- 
strained convex problems are of the form 



The Newton descent direction dp is obtained from the 
linear system 

V}d fi = -V„. 

Solution of linear systems of this structure constitutes 
the main computational effort to be made. Fast solutions 
are obtained by using CHOLMOD [47], which offers an 
efficient implementation for computing the Cholesky fac- 
torization of sparse symmetric, positive definite matrices. 
Since the diagonal of changes from one Newton iter- 
ation to the next, one Cholesky factorization has to be 
performed per Newton step. Once we have solved (14) for 
a specific y, we solve a new problem of the type (14) for 
y • M, M > 1. This is repeated until y exceeds a pre- 
defined maximum value. For a thorough account on the 
log-barrier method, we refer to [20]. 



where we have used the notational shortcuts 

|- =yi-(*P)i + r h i = l,...,n. 
The gradients w.r.t. r and f}, respectively, are given by 

1 1 



V r = 1 



+n) "' +r n )_ 

V = --(<& T ([ E+p 1 -[ H-]" 1 )l+[ 1/h . . . l/fipV ), 
Y 

S ± =diag(§ 1 ± ,...,^ ± ). 

Introducing R = diag(ri, . . . , r n ) and B = 
diag(^i, . . . , ftp), the Hessian is given by the block matrix 
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v r 2 v r/s 



i([S+]- 2 +[3-]- 2 + J R- 2 ) 



i([S+]- 2 <l>-[H-]- 2 *) 



i($ T [ H+]" 2 -$ T [ H-]" 2 ) ^($ T ([ 3 + ]" 2 +[ S-]" 2 )$) +B" 2 ) 













dp _ 







The linear system for the Newton descent directions 
reads 

r v 2 v rP 

Note that V;? is diagonal, so it is a cheap operation to 
resolve for d r once ^ is known: 

^r = -(V r 2 )- 1 (V rj8 ^+V r ). 

Plugging this into the second block of the linear system, 
one obtains 

- V r T ^ V r) _1 (Vr^ dp + V r ) + Vjdfi = 

which is equivalent to 

In order to solve the linear system, we proceed as for 
non-negative least squares. The computational cost of this 
operation is roughly the same, since the sparse struc- 
ture of <P T 4> can still be exploited. For non-negative least 
squares, re-computation of the Hessian only involves 
a diagonal update, an operation of negligible compu- 
tational cost. However, for non-negative least absolute 
deviation, computation involves the matrix multiplica- 
tion (O t ([ E + ] -2 +[ E - ] -2 )4>), i.e. essentially a repeated 
computation of a scaled Gram matrix. In spite of the spe- 
cial structure of <P T 4>, the computational cost is of the 
same order as the solution of the linear system even when 
using a self-written routine for matrix multiplication tai- 
lored to the specific structure. 
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